{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "title: [转]吊炸天的中文自然语言处理工具和语料库介绍\n",
    "date: 2018-06-02 18:17:55\n",
    "tags: [python, 自然语言处理, NLP]\n",
    "toc: true\n",
    "xiongzhang: true\n",
    "xiongzhang_images: [images/awesome-chinese-nlp.jpg]\n",
    "\n",
    "---\n",
    "<span></span>\n",
    "<!-- more -->\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "[![Awesome](https://awesome.re/badge.svg)](https://mlln.cn)\n",
    "\n",
    "\n",
    "![](images/awesome-chinese-nlp.jpg)\n",
    "  \n",
    "  \n",
    "  \n",
    "<br />\n",
    "<br />\n",
    "\n",
    "## Chinese NLP Toolkits 中文NLP工具\n",
    "\n",
    "### Toolkits 综合NLP工具包\n",
    "\n",
    "- [THULAC 中文词法分析工具包](http://thulac.thunlp.org/) by 清华 (C++/Java/Python)\n",
    "\n",
    "- [NLPIR](https://github.com/NLPIR-team/NLPIR) by 中科院 (Java)\n",
    "\n",
    "- [LTP 语言技术平台](https://github.com/HIT-SCIR/ltp) by 哈工大 (C++)\n",
    "\n",
    "- [FudanNLP](https://github.com/FudanNLP/fnlp) by 复旦 (Java)\n",
    "\n",
    "- [BosonNLP](http://bosonnlp.com/) by Boson (商业API服务)\n",
    "\n",
    "- [HanLP](https://github.com/hankcs/HanLP) (Java)\n",
    "\n",
    "- [SnowNLP](https://github.com/isnowfy/snownlp) (Python) Python library for processing Chinese text\n",
    "\n",
    "- [YaYaNLP](https://github.com/Tony-Wang/YaYaNLP) (Python) 纯python编写的中文自然语言处理包，取名于“牙牙学语”\n",
    "\n",
    "- [小明NLP](https://github.com/SeanLee97/xmnlp) (Python) 轻量级中文自然语言处理工具\n",
    "\n",
    "- [DeepNLP](https://github.com/rockingdingo/deepnlp) (Python) Tensorflow实现的自然语言处理工具, 加持预训练的中文模型.\n",
    "\n",
    "- [chinese_nlp](https://github.com/taozhijiang/chinese_nlp) (C++ & Python) 中文自然语言处理工具和案例\n",
    "\n",
    "- [Chinese-Annotator](https://github.com/crownpku/Chinese-Annotator) (Python)  中文文本标注工具\n",
    "\n",
    "### Popular NLP Toolkits for English/Multi-Language 常用的英文或支持多语言的NLP工具包\n",
    "\n",
    "- [CoreNLP](https://github.com/stanfordnlp/CoreNLP) by Stanford (Java)Java核心NLP工具套件。\n",
    "\n",
    "- [NLTK](http://www.nltk.org/) (Python) 自然语言工具包\n",
    "\n",
    "- [spaCy](https://spacy.io/) (Python) 工业水准的自然语言处理\n",
    "\n",
    "- [textacy](https://github.com/chartbeat-labs/textacy) (Python) NLP, before and after spaCy\n",
    "\n",
    "- [OpenNLP](https://opennlp.apache.org/) (Java) 基于机器学习的工具包，用于处理自然语言文本。\n",
    "\n",
    "- [gensim](https://github.com/RaRe-Technologies/gensim) (Python) Gensim是一个用于主题建模，文档索引和大型语料库相似性检索的Python库。\n",
    "\n",
    "\n",
    "  \n",
    "### Chinese Word Segment 中文分词\n",
    "\n",
    "- [Jieba 结巴中文分词](https://github.com/fxsjy/jieba) (Python及大量其它编程语言衍生) 做最好的 Python 中文分词组件\n",
    "\n",
    "- [kcws 深度学习中文分词](https://github.com/koth/kcws) (Python) BiLSTM+CRF与IDCNN+CRF\n",
    "\n",
    "- [ID-CNN-CWS](https://github.com/hankcs/ID-CNN-CWS) (Python) 基于迭代卷积神经网络的中文分词\n",
    "\n",
    "- [Genius 中文分词](https://github.com/duanhongyi/genius) (Python) Genius是一个开源的python中文分词组件，采用 CRF(Conditional Random Field)条件随机场算法。\n",
    "\n",
    "- [loso 中文分词](https://github.com/fangpenlin/loso) (Python)\n",
    "\n",
    "- [yaha \"哑哈\"中文分词](https://github.com/jannson/yaha) (Python)\n",
    "\n",
    "- [ChineseWordSegmentation](https://github.com/Moonshile/ChineseWordSegmentation) (Python) 无需语料库的中文分词\n",
    "\n",
    "  \n",
    "### Information Extraction 信息提取\n",
    "\n",
    "- [MITIE](https://github.com/mit-nlp/MITIE) (C++) 信息提取工具\n",
    "\n",
    "- [Duckling](https://github.com/facebookincubator/duckling) (Haskell) 用于表达，测试和评估输入字符串的可组合语言规则的语言，引擎和工具.\n",
    "\n",
    "- [IEPY](https://github.com/machinalis/iepy) (Python)  IEPY是专注于关系抽取的信息抽取的开源工具。\n",
    "\n",
    "- [Snorkel](https://github.com/HazyResearch/snorkel)一个专注于信息提取的培训数据创建和管理系统\n",
    "\n",
    "- [Neural Relation Extraction implemented with LSTM in TensorFlow](https://github.com/thunlp/TensorFlow-NRE)\n",
    "\n",
    "- [A neural network model for Chinese named entity recognition](https://github.com/zjy-ucas/ChineseNER)\n",
    "\n",
    "- [Information-Extraction-Chinese](https://github.com/crownpku/Information-Extraction-Chinese) Chinese Named Entity Recognition with IDCNN/biLSTM+CRF, and Relation Extraction with biGRU+2ATT 中文实体识别与关系提取\n",
    "\n",
    "- [Familia](https://github.com/baidu/Familia) 百度出品的 工业主题建模工具包\n",
    "\n",
    "- [Text Classification](https://github.com/brightmart/text_classification)各种文本分类模型，更有深度学习的加持。 用知乎问答语聊作为测试数据。\n",
    "\n",
    "  \n",
    "### QA & Chatbot 问答和聊天机器人 \n",
    "\n",
    "- [Rasa NLU](https://github.com/RasaHQ/rasa_nlu) (Python) 将自然语言转化为结构化数据, a Chinese fork at [Rasa NLU Chi](https://github.com/crownpku/Rasa_NLU_Chi)\n",
    "\n",
    "- [Rasa Core](https://github.com/RasaHQ/rasa_core) (Python) 基于机器学习的对话式软件对话引擎\n",
    "\n",
    "- [Snips NLU](https://github.com/snipsco/snips-nlu) (Python) Snips NLU是一个Python库，允许解析用自然语言编写的句子并提取结构化信息。\n",
    "\n",
    "- [DeepPavlov](https://github.com/deepmipt/DeepPavlov) (Python) 一个用于构建端到端对话系统和培训chatbots的开源库。\n",
    "\n",
    "- [ChatScript](https://github.com/bwilcox-1234/ChatScript)自然语言工具/对话管理器，基于规则的聊天机器引擎。\n",
    "\n",
    "- [Chatterbot](https://github.com/gunthercox/ChatterBot) (Python) ChatterBot是一个机器学习，用于创建聊天机器人的对话式对话引擎。\n",
    "\n",
    "- [Chatbot](https://github.com/zake7749/Chatbot) (Python) 基於向量匹配的情境式聊天機器人\n",
    "\n",
    "- [Tipask](https://github.com/sdfsky/tipask) (PHP) 一款开放源码的PHP问答系统，基于Laravel框架开发，容易扩展，具有强大的负载能力和稳定性。\n",
    "\n",
    "- [QuestionAnsweringSystem](https://github.com/ysc/QuestionAnsweringSystem) (Java) 一个Java实现的人机问答系统，能够自动分析问题并给出候选答案。\n",
    "\n",
    "- [QA-Snake](https://github.com/SnakeHacker/QA-Snake) (Python) 基于多搜索引擎和深度学习技术的自动问答\n",
    "\n",
    "- [使用TensorFlow实现的Sequence to Sequence的聊天机器人模型](https://github.com/qhduan/Seq2Seq_Chatbot_QA) (Python)\n",
    "\n",
    "- [使用深度学习算法实现的中文阅读理解问答系统](https://github.com/S-H-Y-GitHub/QA) (Python)\n",
    "\n",
    "- [DuReader中文阅读理解Baseline代码](https://github.com/baidu/DuReader) (Python)\n",
    "\n",
    "- [基于SmartQQ的自动机器人框架](https://github.com/Yinzo/SmartQQBot) (Python)\n",
    "\n",
    "<br />\n",
    "<br />\n",
    "\n",
    "## Corpus 中文语料\n",
    "\n",
    "- [开放知识图谱OpenKG.cn](http://openkg.cn)\n",
    "\n",
    "- [开放中文知识图谱的schema](https://github.com/cnschema/cnschema)\n",
    "\n",
    "- [大规模中文概念图谱CN-Probase](http://kw.fudan.edu.cn/cnprobase/search/) [公众号介绍](https://mp.weixin.qq.com/s?__biz=MzI0MTI1Nzk1MA==&mid=2651675884&idx=1&sn=1a43a93fd5bb53c8a9e48518bfa41db8&chksm=f2f7a05dc580294b227332b1051bfa2e5c756c72efb4d102c83613185b571ac31343720a6eae&mpshare=1&scene=1&srcid=1113llNDS1MvoadhCki83ERW#rd)\n",
    "\n",
    "- [农业知识图谱](https://github.com/qq547276542/Agriculture_KnowledgeGraph) 农业领域的信息检索，命名实体识别，关系抽取，分类树构建，数据挖掘\n",
    "\n",
    "- [CLDC中文语言资源联盟](http://www.chineseldc.org/)\n",
    "\n",
    "- [中文 Wikipedia Dump](https://dumps.wikimedia.org/zhwiki/)\n",
    "\n",
    "- [98年人民日报词性标注库@百度盘](https://pan.baidu.com/s/1gd6mslt)\n",
    "\n",
    "- [搜狗20061127新闻语料(包含分类)@百度盘](https://pan.baidu.com/s/1bnhXX6Z)\n",
    "\n",
    "- [UDChinese](https://github.com/UniversalDependencies/UD_Chinese) (for training spaCy POS)\n",
    "\n",
    "- [中文word2vec模型](https://github.com/to-shimo/chinese-word2vec)\n",
    "\n",
    "- [上百种预训练中文词向量](https://github.com/Embedding/Chinese-Word-Vectors)\n",
    "\n",
    "- [Synonyms:中文近义词工具包](https://github.com/huyingxi/Synonyms/) 基于维基百科中文和word2vec训练的近义词库，封装为python包文件。\n",
    "\n",
    "- [Chinese_conversation_sentiment](https://github.com/z17176/Chinese_conversation_sentiment)中文语义数据集可能对语义分析有用。\n",
    "\n",
    "- [中文突发事件语料库](https://github.com/shijiebei2009/CEC-Corpus) Chinese Emergency Corpus\n",
    "\n",
    "- [dgk_lost_conv 中文对白语料](https://github.com/rustch3n/dgk_lost_conv) 中文对白语料库\n",
    "\n",
    "- [用于训练中英文对话系统的语料库](https://github.com/candlewill/Dialog_Corpus) 用于训练机器人的语料库\n",
    "\n",
    "- [八卦版問答中文語料](https://github.com/zake7749/Gossiping-Chinese-Corpus)\n",
    "\n",
    "- [中国股市公告信息爬取](https://github.com/startprogress/China_stock_announcement) 通过python脚本从巨潮网络的服务器获取中国股市（sz,sh）的公告(上市公司和监管机构)\n",
    "\n",
    "- [tushare财经数据接口](http://tushare.org/) TuShare是一个免费、开源的python财经数据接口包。\n",
    "\n",
    "- [保险行业语料库](https://github.com/Samurais/insuranceqa-corpus-zh)   [[52nlp介绍Blog](http://www.52nlp.cn/%E6%9C%BA%E5%99%A8%E5%AD%A6%E4%B9%A0%E4%BF%9D%E9%99%A9%E8%A1%8C%E4%B8%9A%E9%97%AE%E7%AD%94%E5%BC%80%E6%94%BE%E6%95%B0%E6%8D%AE%E9%9B%86)] 专业领域的语料库\n",
    "\n",
    "- [最全中华古诗词数据库](https://github.com/chinese-poetry/chinese-poetry) 唐宋两朝近一万四千古诗人, 接近5.5万首唐诗加26万宋诗. 两宋时期1564位词人，21050首词。\n",
    "\n",
    "- [DuReader中文阅读理解数据](http://ai.baidu.com/broad/subordinate?dataset=dureader) \n",
    "\n",
    "- [中文语料小数据](https://github.com/crownpku/Small-Chinese-Corpus) 包含了中文命名实体识别、中文关系识别、中文阅读理解等一些小量数据\n",
    "\n",
    "- [中文人名语料库](https://github.com/wainshine/Chinese-Names-Corpus) 中文姓名,姓氏,名字,称呼,日本人名,翻译人名,英文人名。\n",
    "\n",
    "- [中文敏感词词库](https://github.com/observerss/textfilter) 敏感词过滤的几种实现+某1w词敏感词库\n",
    "\n",
    "- [中文简称词库](https://github.com/zhangyics/Chinese-abbreviation-dataset) 中文缩写的一个语料库, including negative full forms.   \n",
    "\n",
    "- [中文数据预处理材料](https://github.com/dongxiexidian/Chinese) 中文分词词典和中文停用词\n",
    "\n",
    "- [漢語拆字字典](https://github.com/kfcd/chaizi)\n",
    "\n",
    "- [SentiBridge: 中文实体情感知识库](https://github.com/rainarch/SentiBridge) 刻画人们如何描述某个实体，包含新闻、旅游、餐饮，共计30万对。\n",
    "\n",
    "- [OpenCorpus](https://github.com/hankcs/OpenCorpus) 免费提供的（中文）语料库。\n",
    "  \n",
    "<br />\n",
    "<br />\n",
    "\n",
    "## Organizations 相关中文NLP组织和会议\n",
    "\n",
    "- [中国中文信息学会](http://www.cipsc.org.cn/)\n",
    "\n",
    "- [NLP Conference Calender](http://cs.rochester.edu/~omidb/nlpcalendar/) NLP社区的主要会议，期刊，研讨会和共享任务。\n",
    "  \n",
    "<br />\n",
    "<br />\n",
    "\n",
    "## Learning Materials 学习资料\n",
    "\n",
    "- [中文Deep Learning Book](https://github.com/exacity/deeplearningbook-chinese)\n",
    "\n",
    "- [Stanford CS224n Natural Language Processing with Deep Learning 2017](http://web.stanford.edu/class/cs224n/syllabus.html)\n",
    "\n",
    "- [Oxford CS DeepNLP 2017](https://github.com/oxford-cs-deepnlp-2017)\n",
    "\n",
    "- [Speech and Language Processing](https://web.stanford.edu/~jurafsky/slp3/) by Dan Jurafsky and James H. Martin\n",
    "\n",
    "- [52nlp 我爱自然语言处理](http://www.52nlp.cn/)\n",
    "\n",
    "- [hankcs 码农场](http://www.hankcs.com/)\n",
    "\n",
    "- [文本处理实践课资料](https://github.com/Roshanson/TextInfoExp) 文本处理实践课资料，包含文本特征提取（TF-IDF），文本分类，文本聚类，word2vec训练词向量及同义词词林中文词语相似度计算、文档自动摘要，信息抽取，情感分析与观点挖掘等实验。\n",
    "\n",
    "- [nlp_tasks](https://github.com/Kyubyong/nlp_tasks) 自然语言处理任务和选定参考\n",
    "\n"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
